262 research outputs found
Developing competitive HMM PoS taggers using small training corpora
This paper presents a study aiming to find out the best strategy to
develop a fast and accurate HMM tagger when only a limited amount of
training material is available. This is a crucial factor when dealing
with languages for which small annotated material is not easily available.
First, we develop some experiments in English, using WSJ corpus as a
test-bench to establish the differences caused by the use of large or
a small train set. Then, we port the results to develop an accurate
Spanish PoS tagger using a limited amount of training data.
Different configurations of a HMM tagger are studied. Namely,
trigram and 4-gram models are tested, as well as different
smoothing techniques. The performance of each configuration depending
on the size of the training corpus is tested in order to determine the
most appropriate setting to develop HMM PoS taggers for languages
with reduced amount of corpus available.Postprint (published version
A Machine learning approach to POS tagging
We have applied inductive learning of statistical decision trees
and relaxation labelling to the Natural Language Processing (NLP)
task of morphosyntactic disambiguation (Part Of Speech Tagging).
The learning process is supervised and obtains a language
model oriented to resolve POS ambiguities. This model consists
of a set of statistical decision trees expressing distribution of
tags and words in some relevant contexts.
The acquired language models are complete enough to be directly
used as sets of POS disambiguation rules, and include more complex
contextual information than simple collections of n-grams usually
used in statistical taggers.
We have implemented a quite simple and fast tagger that has been
tested and evaluated on the Wall Street Journal (WSJ) corpus with
a remarkable accuracy.
However, better results can be obtained by translating the trees
into rules to feed a flexible relaxation labelling based tagger.
In this direction we describe a tagger which is able to use
information of any kind (n-grams, automatically acquired constraints,
linguistically motivated manually written constraints, etc.), and in
particular to incorporate the machine learned decision trees.
Simultaneously, we address the problem of tagging when only
small training material is available, which is crucial in any process
of constructing, from scratch, an annotated corpus. We show that quite
high accuracy can be achieved with our system in this situation.Postprint (published version
Visual Re-ranking with Natural Language Understanding for Text Spotting
Many scene text recognition approaches are based on purely visual information
and ignore the semantic relation between scene and text. In this paper, we
tackle this problem from natural language processing perspective to fill the
gap between language and vision. We propose a post-processing approach to
improve scene text recognition accuracy by using occurrence probabilities of
words (unigram language model), and the semantic correlation between scene and
text. For this, we initially rely on an off-the-shelf deep neural network,
already trained with a large amount of data, which provides a series of text
hypotheses per input image. These hypotheses are then re-ranked using word
frequencies and semantic relatedness with objects or scenes in the image. As a
result of this combination, the performance of the original network is boosted
with almost no additional cost. We validate our approach on ICDAR'17 dataset.Comment: Accepted by ACCV 2018. arXiv admin note: substantial text overlap
with arXiv:1810.0977
Visual Semantic Re-ranker for Text Spotting
Many current state-of-the-art methods for text recognition are based on
purely local information and ignore the semantic correlation between text and
its surrounding visual context. In this paper, we propose a post-processing
approach to improve the accuracy of text spotting by using the semantic
relation between the text and the scene. We initially rely on an off-the-shelf
deep neural network that provides a series of text hypotheses for each input
image. These text hypotheses are then re-ranked using the semantic relatedness
with the object in the image. As a result of this combination, the performance
of the original network is boosted with a very low computational cost. The
proposed framework can be used as a drop-in complement for any text-spotting
algorithm that outputs a ranking of word hypotheses. We validate our approach
on ICDAR'17 shared task dataset
Visual re-ranking with natural language understanding for text spotting
The final publication is available at link.springer.comMany scene text recognition approaches are based on purely visual information and ignore the semantic relation between scene and text. In this paper, we tackle this problem from natural language processing perspective to fill the gap between language and vision. We propose a post processing approach to improve scene text recognition accuracy by using occurrence probabilities of words (unigram language model), and the semantic correlation between scene and text. For this, we initially rely on an off-the-shelf deep neural network, already trained with large amount of data, which provides a series of text hypotheses per input image. These hypotheses are then re-ranked using word frequencies and semantic relatedness with objects or scenes in the image. As a result of this combination, the performance of the original network is boosted with almost no additional cost. We validate our approach on ICDAR'17 dataset.Peer ReviewedPostprint (author's final draft
Semantic relatedness based re-ranker for text spotting
Applications such as textual entailment, plagiarism detection or document clustering rely on the notion of semantic similarity, and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. We present a scenario where semantic similarity is not enough, and we devise a neural approach to learn semantic relatedness. The scenario is text spotting in the wild, where a text in an image (e.g. street sign, advertisement or bus destination) must be identified and recognized. Our goal is to improve the performance of vision systems by leveraging semantic information. Our rationale is that the text to be spotted is often related to the image context in which it appears (word pairs such as Delta–airplane, or quarters–parking are not similar, but are clearly related). We show how learning a word-to-word or word-to-sentence relatedness score can improve the performance of text spotting systems up to 2.9 points, outperforming other measures in a benchmark dataset.Peer ReviewedPostprint (author's final draft
Experiments on applying relaxation labeling to map multilingual hierarchies
This paper explores the automatic construction of a multilingual
Lexical Knowledge Base from preexisting lexical resources. This paper
presents a new approach for linking already existing hierarchies. The
Relaxation labeling algorithm is used to select --among all the
candidate connections proposed by a bilingual dictionary-- the right
conection for each node in the taxonomy.Postprint (published version
FreeLing 3.0: Towards Wider Multilinguality
FreeLing is an open-source multilingual language processing library providing a wide range of analyzers for several languages. It
offers text processing and language annotation facilities to NLP application developers, lowering the cost of building those applications.
FreeLing is customizable, extensible, and has a strong orientation to real-world applications in terms of speed and robustness.
Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.), extend/adapt them to specific domains, or –since the library is open source– develop new ones for specific languages or special application needs. This paper describes the general architecture of the library, presents the major changes and improvements included in FreeLing version 3.0, and summarizes some relevant industrial projects in which it has been used.Postprint (published version
FreeLing: From a multilingual open-source analyzer suite to an EBMT platform.
FreeLing is an open-source library providing a wide range of language analysis utilities for
several different languages. It is intended to provide NLP application developers with any text
processing and language annotation tools they may need in order to simplify their development
task.
Moreover, FreeLing is customizable and extensible. Developers can use the default linguistic
resources (dictionaries, lexicons, grammars, etc.), or extend them, adapt to particular domains, or
even develop new resources for specific languages.
Being open-source has enabled FreeLing to grow far beyond its original capabilities,
especially with regard to linguistic data: contributions from its community of users, for instance,
include morphological dictionaries and PoS tagger training data for Galician, Italian, Portuguese,
Asturian, and Welsh.
In this paper we present the basic architecture and the main services in FreeLing, and we
outline how developers might use it to build competitive NLP systems and indicate how it might
be extended to support the development of Example-Based Machine Translation systems.Postprint (published version
Analizadores Multilingües en FreeLing
FreeLing es una librería de código abierto para el procesamiento multilíngüe automático, que proporciona una amplia gama de servicios de análisis lingüístico para diversos idiomas. FreeLing ofrece a los desarrolladores de aplicaciones de Procesamiento del Lenguaje Natural funciones de análisis y anotación lingüística de textos, con la consiguiente reducción del coste de construcción de dichas aplicaciones. FreeLing es personalizable y ampliable, y está fuertemente orientado a aplicaciones del mundo real en términos de velocidad y robustez. Los desarrolladores pueden utilizar los recursos lingüísticos por defecto (diccionarios, lexicones, gramáticas, etc), ampliarlos, adaptarlos a dominios particulares, o –dado que la librería es de código abierto– desarrollar otros nuevos para idiomas específicos o necesidades especiales de las aplicaciones. Este artículo presenta los principales cambios y mejoras incluidos en la versión 3.0 de FreeLing, y resume algunos proyectos industriales relevantes en los que se ha utilizadoPostprint (published version
- …